Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset. In simpler terms, it helps us summarize the information contained in many variables into a smaller number of “principal components” while retaining as much of the original variation as possible.
Think of PCA as a way to find the most important patterns in your data. It’s like identifying the main ingredients that contribute to the flavor of a complex dish.
The iris dataset is a classic dataset in statistics and machine learning. It contains measurements of 150 iris flowers from three different species (setosa, versicolor, and virginica). For each flower, four features were measured: 1. Sepal length 2. Sepal width 3. Petal length 4. Petal width
Let’s analyze this dataset using PCA to see if we can reduce these four dimensions while still capturing the important patterns.
# The iris dataset is built into R, so we don't need to load it
head(iris) # Look at the first few rows## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# Create a pairs plot to see relationships between variables
pairs(iris[1:4], col = iris$Species,
main = "Iris Data by Species",
pch = 21, bg = c("red", "green", "blue")[unclass(iris$Species)])# Perform PCA on numeric variables (columns 1-4)
# We scale the data (standardize) because the measurements are in different units
iris_pca <- prcomp(iris[, 1:4], scale = TRUE)
# Examine the PCA results
summary(iris_pca)## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
The summary(iris_pca) gives us: - Standard deviation of
each principal component - Proportion of variance explained by each
component - Cumulative proportion of variance explained
The iris_pca$rotation shows us how much each original
variable contributes to each principal component. These values are
called “loadings.”
# Create a dataframe with the principal components
iris_pca_data <- data.frame(
PC1 = iris_pca$x[, 1],
PC2 = iris_pca$x[, 2],
Species = iris$Species
)
# Plot the first two principal components with points colored by species
ggplot(iris_pca_data, aes(x = PC1, y = PC2, color = Species)) +
geom_point(size = 3, alpha = 0.7) +
theme_minimal() +
labs(title = "PCA of Iris Dataset",
x = "Principal Component 1",
y = "Principal Component 2")# Create a scree plot (variance explained by each PC)
var_explained <- iris_pca$sdev^2 / sum(iris_pca$sdev^2)
var_df <- data.frame(
PC = factor(1:4, levels = 1:4),
Variance = var_explained
)
ggplot(var_df, aes(x = PC, y = Variance)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = round(Variance, 2)), vjust = -0.5) +
theme_minimal() +
labs(title = "Variance Explained by Principal Components",
x = "Principal Component",
y = "Proportion of Variance Explained")# Visualize the loadings (contributions of original variables)
loadings_df <- data.frame(
Variable = rownames(iris_pca$rotation),
PC1 = iris_pca$rotation[, 1],
PC2 = iris_pca$rotation[, 2]
)
ggplot(loadings_df, aes(x = PC1, y = PC2)) +
geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(length = unit(0.2, "cm")), color = "red") +
geom_text(aes(label = Variable), vjust = -0.5) +
theme_minimal() +
xlim(-1, 1) + ylim(-1, 1) +
labs(title = "Loadings Plot: Contribution of Original Variables",
x = "Principal Component 1",
y = "Principal Component 2") +
geom_vline(xintercept = 0, linetype = "dashed") +
geom_hline(yintercept = 0, linetype = "dashed")From the PCA of the iris dataset, we typically find:
First principal component (PC1) often explains about 70-80% of the variance and is strongly influenced by petal length, petal width, and sepal length. This suggests these three measurements tend to vary together.
Second principal component (PC2) often explains about 15-20% of the variance and is primarily influenced by sepal width, which varies somewhat independently from the other measurements.
The scatter plot of PC1 vs. PC2 shows clear separation between the three species, with Setosa forming a distinct cluster and some overlap between Versicolor and Virginica.
With just two principal components, we can capture about 95% of the variance in the original four-dimensional dataset.
Remote sensing data typically contains many bands (wavelengths) for each pixel, making it high-dimensional. PCA is commonly used to reduce this dimensionality, enhance features, and compress the data while preserving important information.
For this example, we’ll use real-world Landsat 5 TM (Thematic Mapper) remote sensing data that’s available in R packages. Landsat 5 TM is a multispectral scanner with 7 bands covering visible, near-infrared, and shortwave infrared portions of the electromagnetic spectrum, making it ideal for PCA analysis.
# Load the sample Landsat dataset - a subset of a Landsat 5 TM scene of Austinburg, OH
data(list = "landsat") # This loads the 'lsat' RasterBrick object
# Examine the dataset
lsat## class : SpatRaster
## dimensions : 310, 287, 7 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 619395, 628005, -419505, -410205 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=22 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
## source(s) : memory
## names : B1_dn, B2_dn, B3_dn, B4_dn, B5_dn, B6_dn, ...
## min values : 54, 18, 11, 4, 2, 131, ...
## max values : 185, 87, 92, 127, 148, 146, ...
# This should show 6 bands from Landsat 5 TM:
# 1=blue (band 1, 0.45-0.52 μm), 2=green (band 2, 0.52-0.60 μm), 3=red (band 3, 0.63-0.69 μm),
# 4=near infrared (band 4, 0.76-0.90 μm), 5=shortwave infrared 1 (band 5, 1.55-1.75 μm),
# 7=shortwave infrared 2 (band 7, 2.08-2.35 μm)
# Note: Landsat 5 TM band 6 (thermal infrared, 10.40-12.50 μm) is often omitted in standard analysis
# because it measures emitted energy rather than reflected energy
# Display band information
print(paste("Number of bands:", lsat@ptr$nlyr()))## [1] "Number of bands: 7"
## [1] "Spatial resolution: 30 meters" "Spatial resolution: 30 meters"
## [1] "Dimensions (rows, cols): 310 x 287"
Landsat 5 was operational from 1984 to 2013 (one of the longest-running Earth observation satellites), and its Thematic Mapper (TM) sensor provides excellent data for PCA analysis because:
Each band is sensitive to different Earth surface features: - Bands 1-3 (visible light): Good for cultural features, vegetation types, and water clarity - Band 4 (NIR): Excellent for vegetation analysis and biomass content - Bands 5 & 7 (SWIR): Valuable for geology, soil moisture, and vegetation moisture content
# Create an RGB true color composite (using Red, Green, Blue bands)
plotRGB(lsat, r = 3, g = 2, b = 1, stretch = "lin", main = "True Color Composite")# Create a false color composite (using NIR, Red, Green bands)
# This highlights vegetation in red
plotRGB(lsat, r = 4, g = 3, b = 2, stretch = "lin", main = "False Color Composite (NIR)")# Create another false color composite using SWIR, NIR, and Red
# This can highlight moisture content and geology
plotRGB(lsat, r = 5, g = 4, b = 3, stretch = "lin", main = "False Color Composite (SWIR)")# Perform PCA on the raster stack
landsat_pca <- rasterPCA(lsat)
# Examine the summary of the PCA results
summary(landsat_pca$model)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 34.5862074 12.0022196 2.981810357 1.292922722
## Proportion of Variance 0.8835812 0.1064054 0.006567508 0.001234769
## Cumulative Proportion 0.8835812 0.9899866 0.996554106 0.997788875
## Comp.5 Comp.6 Comp.7
## Standard deviation 1.0982925563 1.0307492287 0.8513311231
## Proportion of Variance 0.0008909979 0.0007847776 0.0005353497
## Cumulative Proportion 0.9986798726 0.9994646503 1.0000000000
# Look at the loadings (how much each original band contributes to each PC)
print(landsat_pca$model$loadings)##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## B1_dn 0.221 0.707 0.334 0.387 0.348 0.258
## B2_dn 0.155 0.407 -0.197 0.102 -0.235 -0.838
## B3_dn 0.273 0.401 -0.324 -0.405 -0.554 0.431
## B4_dn 0.755 -0.613 0.195
## B5_dn 0.624 0.589 -0.368 0.323 -0.144
## B6_dn 0.108 -0.840 0.157 0.500
## B7_dn 0.178 0.345 0.180 -0.734 0.494 -0.184
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143
## Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
# Find out how much variance is explained by each component
explained_var <- landsat_pca$model$sdev^2 / sum(landsat_pca$model$sdev^2)
print(paste("Variance explained by each PC:", paste(round(explained_var * 100, 2), "%", collapse = ", ")))## [1] "Variance explained by each PC: 88.36 %, 10.64 %, 0.66 %, 0.12 %, 0.09 %, 0.08 %, 0.05 %"
# Plot the first four principal components
par(mfrow = c(2, 2))
plot(landsat_pca$map$PC1, main = "Principal Component 1", col = gray.colors(100))
plot(landsat_pca$map$PC2, main = "Principal Component 2", col = gray.colors(100))
plot(landsat_pca$map$PC3, main = "Principal Component 3", col = gray.colors(100))
plot(landsat_pca$map$PC4, main = "Principal Component 4", col = gray.colors(100))par(mfrow = c(1, 1))
# Create an RGB composite using the first three PCs
plotRGB(landsat_pca$map, r = 1, g = 2, b = 3, stretch = "lin",
main = "RGB Composite of First Three Principal Components")# Plot the variance explained by each PC (scree plot)
barplot(explained_var,
names.arg = paste0("PC", 1:length(explained_var)),
main = "Variance Explained by Principal Components",
xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
col = "steelblue")
text(x = 1:length(explained_var),
y = explained_var + 0.02,
labels = paste0(round(explained_var * 100, 1), "%"),
cex = 0.8)# Add the cumulative variance
cumulative_var <- cumsum(explained_var)
plot(cumulative_var, type = "b", xlab = "Number of Principal Components",
ylab = "Cumulative Proportion of Variance Explained",
main = "Cumulative Variance Explained",
ylim = c(0, 1))
abline(h = 0.95, col = "red", lty = 2) # Line at 95% variance explained
text(x = which(cumulative_var >= 0.95)[1],
y = 0.95 + 0.03,
labels = paste0("95% variance with ", which(cumulative_var >= 0.95)[1], " PCs"),
cex = 0.8, col = "red")# Create a function to enhance the visual appearance of an image
# This version works with both older RasterLayer objects and newer SpatRaster objects
enhance_image <- function(img) {
# Get min and max values in a way that works with different raster types
if(inherits(img, "SpatRaster")) {
# For terra package SpatRaster objects
min_val <- terra::minmax(img)[1]
max_val <- terra::minmax(img)[2]
} else {
# For raster package RasterLayer objects
min_val <- raster::minValue(img)
max_val <- raster::maxValue(img)
}
# Apply normalization for better contrast
return((img - min_val) / (max_val - min_val))
}
# Alternative enhance_image function if the above doesn't work
# This uses base R operations that should work on most raster objects
enhance_image_alt <- function(img) {
# Convert to matrix first
if(inherits(img, "SpatRaster") || inherits(img, "RasterLayer")) {
img_mat <- as.matrix(img)
} else {
img_mat <- img
}
# Get min and max values
min_val <- min(img_mat, na.rm = TRUE)
max_val <- max(img_mat, na.rm = TRUE)
# Normalize
img_norm <- (img_mat - min_val) / (max_val - min_val)
# Convert back to raster if needed
if(inherits(img, "SpatRaster")) {
return(terra::rast(img_norm, extent=terra::ext(img), crs=terra::crs(img)))
} else if(inherits(img, "RasterLayer")) {
return(raster::raster(img_norm, template=img))
} else {
return(img_norm)
}
}
# Try to create enhanced versions of some original bands for comparison
# Try the first method and if it fails, use the alternative
tryCatch({
enhanced_nir <- enhance_image(lsat[[4]]) # NIR band
enhanced_swir <- enhance_image(lsat[[5]]) # SWIR band
enhanced_pc1 <- enhance_image(landsat_pca$map$PC1)
enhanced_pc2 <- enhance_image(landsat_pca$map$PC2)
}, error = function(e) {
message("Using alternative enhancement method due to error: ", e$message)
enhanced_nir <- enhance_image_alt(lsat[[4]]) # NIR band
enhanced_swir <- enhance_image_alt(lsat[[5]]) # SWIR band
enhanced_pc1 <- enhance_image_alt(landsat_pca$map$PC1)
enhanced_pc2 <- enhance_image_alt(landsat_pca$map$PC2)
})
# Plot the comparison
par(mfrow = c(2, 2))
plot(enhanced_nir, main = "Original NIR Band", col = gray.colors(100))
plot(enhanced_swir, main = "Original SWIR Band", col = gray.colors(100))
plot(enhanced_pc1, main = "Principal Component 1", col = gray.colors(100))
plot(enhanced_pc2, main = "Principal Component 2", col = gray.colors(100))
# Plot the comparison
par(mfrow = c(2, 2))
plot(enhanced_nir, main = "Original NIR Band", col = gray.colors(100))
plot(enhanced_swir, main = "Original SWIR Band", col = gray.colors(100))
plot(enhanced_pc1, main = "Principal Component 1", col = gray.colors(100))
plot(enhanced_pc2, main = "Principal Component 2", col = gray.colors(100))par(mfrow = c(1, 1))
# Create a composite that would highlight specific features
# For example, PC1 often represents overall brightness/albedo
# PC2 might represent vegetation vs. non-vegetation contrast
# PC3 might highlight urban areas or water features
plotRGB(landsat_pca$map, r = 3, g = 2, b = 1, stretch = "lin",
main = "Alternative PC Composite (PC3, PC2, PC1)")In this real-world Landsat 5 TM data, the PCA results can be interpreted as:
First principal component (PC1) typically captures the overall brightness or albedo of the scene (around 70-80% of the variance). It shows the major landscape features and topography. In Landsat imagery, this often correlates with the general landscape structure.
Second principal component (PC2) often highlights the contrast between vegetation (high NIR reflectance) and non-vegetation areas (around 10-15% of the variance). It may emphasize agricultural fields, forests, and urban areas differently.
Third and fourth components (PC3, PC4) reveal more subtle features that aren’t obvious in the original bands, such as different vegetation types, soil moisture variations, or specific land cover types. They might highlight water bodies, certain crop types, or geological features.
The later components (PC5, PC6, PC7) typically contain very little of the total variance (often <1-2%) and mostly show noise, although they occasionally reveal subtle features of interest or specific anomalies in the landscape.
Data compression: Instead of working with all 6 Landsat 5 TM bands, you can often work with just 2-3 PCs that explain 95% or more of the variance. This significantly reduces storage requirements and processing time.
Feature enhancement: The PCA transformation often reveals features that aren’t clearly visible in the original bands. For example, PC2 or PC3 might highlight subtle geological features, urban structures, or different crop types that blend together in standard composites.
Change detection: By performing PCA on multi-temporal Landsat images (e.g., before and after a forest fire or urban development), changes over time can be highlighted more effectively than with original bands.
Noise reduction: Since later PCs (PC5, PC6, PC7) often contain mostly noise in Landsat data, you can reconstruct cleaner imagery by using only the first few PCs.
Land cover classification: PCA can improve classification accuracy by reducing band correlation and highlighting the most important information, making it easier to distinguish between similar land cover types.
To further explore PCA with Landsat 5 TM data, you could:
Compare seasonal changes: Apply PCA to Landsat images from different seasons to see how the landscape components change.
Analyze specific land cover types: Extract subsets of the image (e.g., just forest areas or agricultural fields) and perform PCA on those to detect subtle variations within a single land cover type.
Create specialized indices: Combine PCs to create specialized indices for detecting specific features of interest, similar to how normalized difference vegetation index (NDVI) works with original bands.
# Example of creating an index from PCs
# This is hypothetical - the actual usefulness would depend on your specific data
pc_index <- (landsat_pca$map$PC2 - landsat_pca$map$PC3) / (landsat_pca$map$PC2 + landsat_pca$map$PC3)
plot(pc_index, main = "PC-based Index", col = terrain.colors(100))As shown in the document example, you can create various interesting visualizations by combining different principal components in RGB composites. Since PC1 often represents overall brightness/albedo, PC2 typically highlights vegetation vs. non-vegetation contrast, and PC3 might highlight urban areas or water features, different combinations can reveal different aspects of the landscape.
PCA is a powerful technique for dimensional reduction and feature extraction in various fields, from simple datasets like Iris to complex remote sensing imagery like Landsat 5 TM data. By reducing the dimensionality of your data while preserving most of the important information, PCA can help you:
In the Landsat 5 TM example, we saw how PCA can transform 6 spectral bands into a smaller set of uncorrelated components that still retain most of the information. The first few principal components typically capture the major landscape features, vegetation patterns, and geological structures, while later components may reveal subtle features or isolate noise.
Remember that while PCA is powerful, it has limitations: - It assumes linear relationships between variables - It’s sensitive to outliers - The principal components may not always be easily interpretable
For beginners working with remote sensing data, it’s especially important to compare the PCA results with the original bands to build an understanding of what information each principal component represents in the physical landscape.
PCA is not just a mathematical technique but a practical tool that can enhance your ability to extract meaningful information from complex multispectral data like Landsat imagery.